视觉关系构成了理解我们的构图世界的基础,因为视觉对象之间的关系捕获了场景中的关键信息。然后,从数据自动学习关系是有利的,因为使用预定义的标签学习无法捕获所有可能的关系。但是,当前的关系学习方法通​​常需要监督,并且并不是旨在概括与培训期间相比,具有更复杂关系结构的场景。在这里,我们介绍了Virel,这是一种使用图形级别类比的无监督发现和学习视觉关系的方法。在任务中的场景共享相同的基本关系子图结构的环境中,我们对比的同构和非同构图的学习方法以无聊的方式发现了跨任务的关系。一旦学习了关系,Virel就可以通过解析预测的关系结构来检索每个任务的共享关系图结构。使用基于网格世界和抽象推理语料库的数据集,我们表明我们的方法在关系分类中达到了95%的精度,发现了大多数任务的关系图结构,并进一步概括了具有更复杂关系结构的看不见的任务。
translated by 谷歌翻译
人类具有以零拍的方式识别和获取新颖的视觉概念的非凡能力。考虑到以前学到的视觉概念及其关系的高级,象征性的描述,人类可以识别新颖的概念而不看到任何例子。此外,他们可以通过学习视觉概念和关系来解析和传达符号结构来获取新概念。赋予机器中的这些功能在提高推理时提高其概括能力方面至关重要。在这项工作中,我们介绍了零拍的概念识别和获取(ZEROC),这是一种神经符号结构,可以以零拍的方式识别和获取新颖的概念。 ZEROC代表概念作为组成概念模型的图(作为节点)及其关系(作为边缘)。为了允许推理时间组成,我们采用基于能量的模型(EBM)来建模概念和关系。我们设计ZEROC架构,以便它允许在概念的符号图结构及其相应的EBM之间进行一对一的映射,该图是第一次允许获取新概念,传达其图形结构并将其应用于分类和分类和在推理时检测任务(甚至跨域)。我们介绍了用于学习和推断ZEROC的算法。我们在一个充满挑战的网格世界数据集上评估了零,该数据集旨在探测零拍的概念识别和获取,并展示其功能。
translated by 谷歌翻译
在许多科学和工程领域(例如流体动力学,天气预报及其反相反的优化问题)中,模拟大规模系统的部分微分方程(PDE)的时间演变至关重要。但是,由于它们的局部进化,因此经典的求解器和最近的基于深度学习的替代模型通常在计算中都非常密集:他们需要在推理期间的每个时间步骤更新每个离散的单元格的状态。在这里,我们开发了PDE(LE-PDE)的潜在进化,这是一种简单,快速和可扩展的方法,可以加速PDE的仿真和逆优化。 Le-Pde学习了系统的紧凑,全球表示,并通过学习的潜在进化模型有效地在潜在空间中充分进化。 LE-PDE通过在长时间推出期间更新的潜在维度要更新而与输入空间更新相比,可以实现加速。我们介绍了新的学习目标,以有效地学习这种潜在动力,以确保长期稳定。我们进一步介绍了通过在潜在空间中通过反向传播来加速PDE的边界条件的反向优化的技术,以及一种退火技术来解决边界条件的非差异性和稀疏相互作用。我们以非线性PDE的1D基准测试我们的方法,2D Navier-Stokes流入湍流相,并在2D Navier-Stokes流中对边界条件进行反相反优化。与最先进的基于深度学习的替代模型和其他强大的基线相比,我们证明了更新的尺寸降低了128倍,速度提高了15倍,同时提高了竞争精度。
translated by 谷歌翻译
地下模拟使用计算模型来预测流体(例如油,水,气体)通过多孔介质的流动。这些模拟在工业应用(例如石油生产)中至关重要,在这些应用中,需要快速,准确的模型来进行高级决策,例如,进行井安置优化和现场开发计划。经典的有限差数数值模拟器需要大量的计算资源来对大规模现实世界的水库进行建模。另外,通过依靠近似物理模型,流线模拟器和数据驱动的替代模型在计算上更有效,但是它们不足以在大规模上对复杂的储层动力学进行建模。在这里,我们介绍了混合图网络模拟器(HGNS),这是一个数据驱动的替代模型,用于学习3D地下流体流的储层模拟。为了模拟局部和全球尺度上的复杂储层动力学,HGN由地下图神经网络(SGNN)组成,以建模流体流的演化和3D-U-NET,以建模压力的演变。 HGNS能够扩展到每个时间步长数百万个单元的网格,比以前的替代模型高两个数量级,并且可以准确地预测流体流量数十亿个时间步长(未来几年)。使用带有110万个单元的行业标准地下流数据集(SPE-10),我们证明HGNS能够将推理时间降低到与标准地下模拟器相比,最高18次,并且通过降低基于学习的模型,它可以优于其他基于学习的模型长期预测错误高达21%。
translated by 谷歌翻译
Federated learning allows multiple clients to collaboratively train a model without exchanging their data, thus preserving data privacy. Unfortunately, it suffers significant performance degradation under heterogeneous data at clients. Common solutions in local training involve designing a specific auxiliary loss to regularize weight divergence or feature inconsistency. However, we discover that these approaches fall short of the expected performance because they ignore the existence of a vicious cycle between classifier divergence and feature mapping inconsistency across clients, such that client models are updated in inconsistent feature space with diverged classifiers. We then propose a simple yet effective framework named Federated learning with Feature Anchors (FedFA) to align the feature mappings and calibrate classifier across clients during local training, which allows client models updating in a shared feature space with consistent classifiers. We demonstrate that this modification brings similar classifiers and a virtuous cycle between feature consistency and classifier similarity across clients. Extensive experiments show that FedFA significantly outperforms the state-of-the-art federated learning algorithms on various image classification datasets under label and feature distribution skews.
translated by 谷歌翻译
在本文中,我们研究了基于骨架的动作识别的问题,该问题在学习从基础阶级到新颖类的可转移表示方面构成了独特的挑战,尤其是针对细粒度的动作。现有的元学习框架通常依赖于空间维度中的身体级表示,这限制了概括以捕获细粒标签空间中细微的视觉差异。为了克服上述局限性,我们提出了一种基于单发骨架的动作识别的部分感知的原型代表。我们的方法捕获了两个独特的空间级别的骨架运动模式,一种用于所有身体关节的全球环境,称为身体水平,另一个则参与了身体部位的局部空间区域,称为零件水平。我们还设计了一种类不足的注意机制,以突出每个动作类别的重要部分。具体而言,我们开发了一个由三个模块组成的零件感知原型图网络:我们的双层建模的级联嵌入模块,一个基于注意力的零件融合模块,用于融合零件并生成零件感知的原型,以及可以执行匹配的模块。与部分意识表示的分类。我们证明了我们方法对两个基于公共骨架的动作识别数据集的有效性:NTU RGB+D 120和NW-UCLA。
translated by 谷歌翻译
建模各种时空依赖项是识别骨架序列中人类动作的关键。大多数现有方法过度依赖于遍历规则或图形拓扑的设计,以利用动态关节的依赖性,这是反映远处但重要的关节的关系不足。此外,由于本地采用的操作,因此在现有的工作中探索了重要的远程时间信息。为了解决这个问题,在这项工作中,我们提出了LSTA-Net:一种新型长期短期时空聚合网络,可以以时空的方式有效地捕获长/短距离依赖性。我们将我们的模型设计成纯粹的分解体系结构,可以交替执行空间特征聚合和时间特征聚合。为了改善特征聚合效果,还设计和采用了一种通道明智的注意机制。在三个公共基准数据集中进行了广泛的实验,结果表明,我们的方法可以在空间和时域中捕获长短短程依赖性,从而产生比其他最先进的方法更高的结果。代码可在https://github.com/tailin1009/lsta-net。
translated by 谷歌翻译
A recent study has shown a phenomenon called neural collapse in that the within-class means of features and the classifier weight vectors converge to the vertices of a simplex equiangular tight frame at the terminal phase of training for classification. In this paper, we explore the corresponding structures of the last-layer feature centers and classifiers in semantic segmentation. Based on our empirical and theoretical analysis, we point out that semantic segmentation naturally brings contextual correlation and imbalanced distribution among classes, which breaks the equiangular and maximally separated structure of neural collapse for both feature centers and classifiers. However, such a symmetric structure is beneficial to discrimination for the minor classes. To preserve these advantages, we introduce a regularizer on feature centers to encourage the network to learn features closer to the appealing structure in imbalanced semantic segmentation. Experimental results show that our method can bring significant improvements on both 2D and 3D semantic segmentation benchmarks. Moreover, our method ranks 1st and sets a new record (+6.8% mIoU) on the ScanNet200 test leaderboard. Code will be available at https://github.com/dvlab-research/Imbalanced-Learning.
translated by 谷歌翻译
Weakly-supervised object localization aims to indicate the category as well as the scope of an object in an image given only the image-level labels. Most of the existing works are based on Class Activation Mapping (CAM) and endeavor to enlarge the discriminative area inside the activation map to perceive the whole object, yet ignore the co-occurrence confounder of the object and context (e.g., fish and water), which makes the model inspection hard to distinguish object boundaries. Besides, the use of CAM also brings a dilemma problem that the classification and localization always suffer from a performance gap and can not reach their highest accuracy simultaneously. In this paper, we propose a casual knowledge distillation method, dubbed KD-CI-CAM, to address these two under-explored issues in one go. More specifically, we tackle the co-occurrence context confounder problem via causal intervention (CI), which explores the causalities among image features, contexts, and categories to eliminate the biased object-context entanglement in the class activation maps. Based on the de-biased object feature, we additionally propose a multi-teacher causal distillation framework to balance the absorption of classification knowledge and localization knowledge during model training. Extensive experiments on several benchmarks demonstrate the effectiveness of KD-CI-CAM in learning clear object boundaries from confounding contexts and addressing the dilemma problem between classification and localization performance.
translated by 谷歌翻译
Witnessing the impressive achievements of pre-training techniques on large-scale data in the field of computer vision and natural language processing, we wonder whether this idea could be adapted in a grab-and-go spirit, and mitigate the sample inefficiency problem for visuomotor driving. Given the highly dynamic and variant nature of the input, the visuomotor driving task inherently lacks view and translation invariance, and the visual input contains massive irrelevant information for decision making, resulting in predominant pre-training approaches from general vision less suitable for the autonomous driving task. To this end, we propose PPGeo (Policy Pre-training via Geometric modeling), an intuitive and straightforward fully self-supervised framework curated for the policy pretraining in visuomotor driving. We aim at learning policy representations as a powerful abstraction by modeling 3D geometric scenes on large-scale unlabeled and uncalibrated YouTube driving videos. The proposed PPGeo is performed in two stages to support effective self-supervised training. In the first stage, the geometric modeling framework generates pose and depth predictions simultaneously, with two consecutive frames as input. In the second stage, the visual encoder learns driving policy representation by predicting the future ego-motion and optimizing with the photometric error based on current visual observation only. As such, the pre-trained visual encoder is equipped with rich driving policy related representations and thereby competent for multiple visuomotor driving tasks. Extensive experiments covering a wide span of challenging scenarios have demonstrated the superiority of our proposed approach, where improvements range from 2% to even over 100% with very limited data. Code and models will be available at https://github.com/OpenDriveLab/PPGeo.
translated by 谷歌翻译